Skip to content

Use GetLatestSnapshot in CleanupRecordExists#8594

Merged
colm-mchugh merged 2 commits into
citusdata:mainfrom
sfc-gh-mslot:marcoslot/cleanup-snapshot
Jun 3, 2026
Merged

Use GetLatestSnapshot in CleanupRecordExists#8594
colm-mchugh merged 2 commits into
citusdata:mainfrom
sfc-gh-mslot:marcoslot/cleanup-snapshot

Conversation

@sfc-gh-mslot
Copy link
Copy Markdown
Contributor

DESCRIPTION: Prevent a possible race condition in shard cleanup

systable_beginscan without a snapshot calls GetCatalogSnapshot, which might be cached from a previous call. That means that if a shard move succeeded and the deletion of the cleanup record completed just before the cleaner tries to acquire a lock, CleanupRecordExists might still see the record in its snapshot and shard cleanup will incorrectly proceed.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.73%. Comparing base (f31fe24) to head (3171fce).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8594      +/-   ##
==========================================
+ Coverage   87.99%   88.73%   +0.74%     
==========================================
  Files         288      288              
  Lines       64382    64385       +3     
  Branches     8108     8109       +1     
==========================================
+ Hits        56650    57133     +483     
+ Misses       5381     4911     -470     
+ Partials     2351     2341      -10     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ScanKeyInit(&scanKey[0], Anum_pg_dist_cleanup_record_id,
BTEqualStrategyNumber, F_INT8EQ, Int64GetDatum(recordId));

/* make sure we can always see deletion after acquiring the operation ID lock */
Copy link
Copy Markdown
Contributor

@colm-mchugh colm-mchugh May 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comment format strays from convention. Maybe add this text to the function comment at line 1179 ?

Copy link
Copy Markdown
Contributor

@colm-mchugh colm-mchugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR, it looks like it fixes a real race condition by ensuring the catalog is read at a fresh snapshot. Is it possible to add a regress test ? An isolation test where the maintenance daemon contends over pg_dist_cleanup with a shard move/split operation, for example. The main concern is that without regress test(s), a future code change could compromise this fix.

@sfc-gh-mslot
Copy link
Copy Markdown
Contributor Author

Is it possible to add a regress test ? An isolation test where the maintenance daemon contends over pg_dist_cleanup with a shard move/split operation

Since cleanup does not block waiting for the lock, the isolation testing framework cannot intercept it. It would require injection points, but Citus does not use those yet.

@colm-mchugh
Copy link
Copy Markdown
Contributor

colm-mchugh commented May 24, 2026

Since cleanup does not block waiting for the lock, the isolation testing framework cannot intercept it. It would require injection points, but Citus does not use those yet.

How did you encounter this bug ? Curious what prompted the PR. Another test approach is a superuser only GUC that reflects a millisec delay, and enables a test to create a race and assert there's only one drop. But that may be overkill for this PR - possibly requires a TAP test - so for this but can you update the function's comment with fix details before we merge it ? And also a note about a future test using Postgres injection points. It's worth emphasizing since comments are a useful (and in this case, the only) defense again a silent regression.

sfc-gh-mslot and others added 2 commits June 3, 2026 17:18
Expand the function header to record:
  * Why GetLatestSnapshot() is required (post-TryLockOperationId
    visibility of a delete committed by the operation owner just
    before the advisory lock was released);
  * That this helper is not appropriate for callers that want their
    own outer-transaction snapshot semantics;
  * That no automated regression test exists today because cleanup
    uses TryLockOperationId(dontWait=true), which pg_isolationtester
    cannot interleave, and that adopting PostgreSQL's INJECTION_POINT()
    infrastructure (or an equivalent test-only delay GUC) is required
    before such a test can be written;
  * That, until that test infrastructure exists, the GetLatestSnapshot()
    call is load-bearing and must not be reverted to NULL.

This is a follow-up to the snapshot fix in the previous commit and
does not change behavior.
@colm-mchugh colm-mchugh force-pushed the marcoslot/cleanup-snapshot branch from 2d018eb to 3171fce Compare June 3, 2026 17:27
@sfc-gh-mslot
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree company="Snowflake"

@sfc-gh-mslot
Copy link
Copy Markdown
Contributor Author

We found it because we use a similar technique in https://github.com/Snowflake-Labs/pg_lake/blob/main/pg_lake_engine/src/cleanup/in_progress_files.c for file cleanup in pg_lake, which happens at a much higher rate than shard moves making it easier to detect the bug. At some point we realized Citus might have a similar bug.

@colm-mchugh colm-mchugh merged commit c41586c into citusdata:main Jun 3, 2026
228 of 230 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants